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We classify twenty-one Indo-European languages starting from written text. We use neural net¬ 
works in order to define a distance among different languages, construct a dendrogram and analyze 
the ultrametric structure that emerges. Four or five subgroups of languages are identified, according 
to the “cut” of the dendrogram, drawn with an entropic criterion. The results and the method are 
discussed. 
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I. INTRODUCTION AND MOTIVATIONS 

The identification and classification of languages is a 
complex and interesting subject, that has roots in many 
areas of sciences and humanities. This problem can be 
tackled in different ways. We try here an approach that 
hinges on written text and makes use of neural networks. 
There is a vast literature on this and related subjects, 
ranging from artificial intelligence and computer science 
[T], to linguistics [2] and the physical literature [3]. We 
shall hinge on physical and mathematical concepts and 
ideas. 

Our main objective is to define a distance among a 
given set of languages and identify sub-groups of lan¬ 
guages that are more similar. In our case study, the set 
is made up of the 21 Indo-European languages listed in 
Fig. [3 The strategy consists in asking a feed-forward 
neutral network to distinguish them two by two. The er¬ 
ror will be taken as a measure of the similarity between 
the two languages: a large error signifies closeness be¬ 
tween the two languages, while a small error indicates a 
large separation. 

The main idea is to simulate the learning stage of an 
unexperienced speaker (or even a child). Our starting 
point is the assumption that a beginner who is learn¬ 
ing (say) English will find German more familiar (closer) 
than Portuguese. If asked to discriminate English from 
German she/he/it will make more mistakes than when 
discriminating English from Portuguese. We shall mimic 
the above-mentioned learning stage through an artificial 
neural network. The use of the latter in cognitive sci¬ 
ences [4] and in fields of investigation that can be con¬ 
sidered “complex” is often inter-disciplinary. Among re¬ 
lated areas of study there is artificial intelligence inspired 
by biology 0 , DNA sequence classification and disease 
classification and diagnoses [6]. 

Our focus will be on physical ideas and concepts, and 
in particular on how physical notions, such as that of en¬ 
tropy, enable one to identify and classify sub-groups of 
languages that share some similarity. Since our method 
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FIG. 1: Languages and acronyms. 


hinges on written text, the rules of alphabet transcrip¬ 
tion will play an important role and (presumably) have 
an influence on the final classification [?]• This moti¬ 
vated us to look at a group of 21 Indo-European lan¬ 
guages written in Roman characters, with two exception: 
Greek and Maltese. Maltese is written Roman charac¬ 
ters, but unlike the other languages in the group, is a 
Semitic language. Greek is Indo-European, but is writ¬ 
ten in a different script. It will be interesting to see which 
of these aspects prevails and how these two languages are 
classified. 


This article is organized as follows. In Sec.|IT]we intro¬ 
duce the method of investigation, based on neural net¬ 
works. Languages are discriminated with some error, and 
this enables one to define a distance. The features of this 
error are analyzed in Secs. |nT]and|IV) where the sentences 
are modified and manipulated in order to test the method 
of discrimination. In Sec. 0 we recall the mathematical 
definition of distance and describe the linkage algorithm, 
introducing the cophenetic coefficient. We proceed to 
classify languages in Sec. VI and discuss the tree and ul¬ 
trametric structure that emerge in Sec. m We conclude 
in Sec. |VIII[ The methods utilized, the learning phase 
and the code are detailed in Appendix [A] 
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II. NEURAL NETWORK 

A. Method 

We use a feed-forward neural network with back- 
propagation [8]. The objective of each “run” is to dis¬ 
criminate two given languages of the set. The neural net¬ 
work is fed with 20,000 sentences (10,000 per language), 
taken from the Leipzig Corpora Collection and differ¬ 
ent in length. 

The first 40 characters are extracted from each sen¬ 
tence; spaces, accents, numbers and punctuation are re¬ 
moved and special (non ASCII) characters are replaced 
by their ASCII counterpart (e.g. French and Spanish “g” 
is replaced by “c” and German “6” by “ss”). These re¬ 
placements make languages more similar and language 
discrimination more difficult, so that the task of the neu¬ 
ral network is harder. 

Each character is encoded in a 26-dimensional vector 
m [TT] . Other types of input encoding offer no guar¬ 
antee on the correct association of the weights of the 
synapsis to different characters and the sensibleness of 
the ensuing language discrimination. We avoided the use 
of features extracted from sentences, that make use of a 
priori assumptions and can introduce some bias. (We are 
interested in classifying data when no a priori knowledge 
is given or available, as in some phylogenetic analyses of 
biological sequences m) 

The neural network is made up of three layers. The 
first layer contains 1040(= 26 x 40) input nodes, the cen¬ 
tral layer 500 nodes, the output layer 2 nodes. Each time, 
one language is discriminated against another language: 
the two output nodes give the probability of correctly 
interpreting the input. The scheme is depicted in Eig.[^ 

The results are averaged according to a 5-fold cross- 
validation technique. Specifically, our validation set is 
made up of 4,000 sentences. The neural network makes 
an error when, if fed with language A, it mistakenly an¬ 
swers B (or vice versa). A training period, that mini¬ 
mizes an appropriate cost function, is repeated against 
the remaining 16,000 sentences, until the error against 
the validation set is steady. In this way, a set of 5 errors 
is obtained. The labels A and B are then interchanged 
and after a new training period, 5 additional errors are 
otained. The error e used in the following is the average 
over the afore-mentioned 10-fold run. The training and 
code utilized are detailed in Appendix [A} 


B. Praising errors 

The afore-mentioned neural network works very effi¬ 
ciently, making language discrimination very effective. 
After the training period, the error is always small (typi¬ 
cally a few percent, becoming at most 20% when the two 
languages are very similar). 

We shall make use of the error e to define a distance d 



FIG. 2: The discrimination scheme. A 40-character-long sen¬ 
tence is fed to the neural network, whose input, central and 
output layers consist of 1040(= 26 x 40), 500 and 2 nodes 
respectively. After a training period, during which 16,000 
sentences in two languages are fed to the network, the two 
output nodes yield the probability that a given sentence is 
classified as belonging to one of the two languages. Spaces, 
accents and punctuation are removed, and the sentence is al¬ 
ways cut at 40 characters, so that “Alice thought she might 
as well wait, as she had nothing else to do, and perhaps af¬ 
ter all it might tell her something worth hearing.”, becomes 
“alicethoughtshemightaswellwaitasshehadno”. 


between the two languages, according to the formula 

d = - - 1 e K+. (1) 

e 

The above definition entails a certain degree of arbitrari¬ 
ness, as any monotonic function of 1/e would be equally 
suited. One advantage is that d takes values on the pos¬ 
itive real half-line K+ = [0,oo). 

At each run, the neural network discriminates two 
given languages in the set. When the error and the dis¬ 
tance between these two languages are obtained, a new 
run starts: the network is reset and a new couple of lan¬ 
guages is used. This procedure yields a matrix of dis¬ 
tances. Before embarking in the description of this task, 
we shall test the system and try to elucidate its function¬ 
ing and characteristics. 


III. ERROR VS INPUT LENGTH 

In the preceding section and throughout this article, 
the length n of the input sentences is taken to be n = 40. 
This is dictated by efficiency and by the duration of the 
computer simulation. In Eigure the error is plotted 
as a function of the length of the input sentence n, for 
n = 30,..., 60. We are discriminating here English vs 
Italian. 

The first layer of the neural network changes, because 
the number of input nodes N is taken to be 

V = 26 X n, 


( 2 ) 
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FIG. 3: English vs Italian: error e (percent) vs length n of 
input sentence. Upper panel: linear plot. Lower panel: loga¬ 
rithmic plot. The error bars are the standard deviations over 
the cross-validation ensemble of the neural network. (The 
point at n = 70 is not taken into account for the fit.) 

where n is the length of input sentence. The middle and 
output layers do not change. The fit yields a power law 

e = An^, (3) 

with A = 128.59±6.02 (percent), 7 = —0.86±0.07. This 
enables one to assert that if one has enough computa¬ 
tional power, the error in discriminating two languages 
can be made arbitrarily small. In order to corroborate 
the analysis, after determining the parameters A and 7 , 
we added a point at n = 70. This additional point is very 
accurately fitted, validating the ansatz (§. 

One may wonder whether the coefficients of the fit (7 
in particular) are language-dependent. In Figure the 
error is plotted as a function of the length of the in¬ 
put sentence n when the network discriminates French 
vs Swedish. The fit always yields a power law, but the 
coefficients are different. 


IV. SCRAMBLING THE CHARACTERS 

One could conjecture that the neural network is simply 
counting the frequency of the letters in a given language. 
In order to test this hypothesis, we randomly “scram¬ 
bled” the letters of the 20,000 input sentences ( 10,000 
X two languages) and asked the network to discriminate 
the two languages. 

The result for English vs Italian is displayed in Fig. 
[^(7 = 1 means that the sentence is randomly scrambled 
character by character, so that the neural network in this 
case simply “counts” the frequency of characters in each 
language; cr = 2 means that the sentence is scrambled 
in “cells” made up of couple of characters, and so on; 
CT = 40 means that the sentence is not scrambled. 

The points plotted in Fig. are (close to) the divisors 
of 40 (length of the input sentences). When the division is 



n 



FIG. 4: French vs Swedish: error e (percent) vs length n 
of input sentence. Upper panel: linear plot. Lower panel: 
logarithmic plot. The error bars are the standard deviations 
over the cross-validation ensemble of the neural network. 


not exact, a number of zeros is added to the last character 
and removed at the end of the procedure. Example: for 
the point with abscissa = 3 we added two O’s at the 
end of the (40-character-long) sentence, so that 40 -I- 2 = 
42 becomes divisible by 3, and there are 14 triples of 
characters; the two spurious zeros are removed at the 
end of the procedure. 
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FIG. 5: Errors e vs size cr of the scrambled “cell”. The er¬ 
ror bars are the standard deviation over 8 repetitions of the 
cross-validation ensemble of the neural network. (Italian vs 
English) 

The whole procedure aims at understanding whether 
the network is able to detect the presence of “correla¬ 
tions” with longer range within a language and whether 
it makes use of this information when performing its dis¬ 
crimination task. It is clear that the error monotonically 
decreases with tr: the network detects longer-range corre¬ 
lations. At the same time, the statistics is not sufficient 
to draw clear-cut conclusions about the cr-dependence of 
e. For this reason, in Fig. the same procedure was 
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FIG. 6: Errors e vs size cr of the scrambled “cell”. The er¬ 
ror bars are the standard deviation over the cross-validation 
ensemble of the neural network. (French vs Swedish) 


repeated when discriminating French vs Swedish. The 
trend is analogous. 


V. DISTANCE, LINKAGE AND 
DENDROGRAMS 

We now briefly recall some mathematical notions. We 
start by recalling the definition of distance, then describe 
the linkage algorithm, finally introduce the cophenetic 
coefficient. We proceed to classify languages in the next 
section. 


A. Distance 

A metric d on a set S of points is a non-negative ap¬ 
plication (distance) 


d : S X S —>■ K+, (4) 

where M+ = [0, oo), endowed with the following proper¬ 
ties, valid for each couple of points x,y G S: 

d{x,y) = 0 4=^ x = y, (5) 

d{x,y) = d{y,x), (6) 

d{x,y) < d{x, z) + d{z,y). WzGS (7) 

A metric is called an ultrametric if it satisfies the follow¬ 
ing stronger version of the triangle inequality ([^ 

d{x, z) < max{d{x, y), d{y, z)). (8) 


B. Linkage algorithms 


In order to classify the 21 languages listed in Fig. [^and 
identify subgoups in the set, we need a linkage algorithm. 
Linkage algorithms yield a cluster structure that is dis¬ 
played in the form of a tree or dendrogram [14j . We will 
adopt an agglomerative algorithm, by linking the clusters 
through an iterative process. The original data set S is 
made up of to = 21 elements (points). At the first level 
(leaves of the dendrogram) the number of classes in the 
data set is to = 21 . At the first iteration the two closest 
elements are clustered together, reducing the number of 
classes to to — 1 = 20 (if more than two elements are 
at the closest distance, we pick a random couple among 
them). At the second iteration one defines a new dis¬ 
tance d' between the remaining elements of S and the 
first cluster formed. The distances are then recalculated 
and the two closest objects are joined. At the following 
iterations one must define a new distance among points 
and clusters. After to steps, all the points are grouped 
together in one single cluster, corresponding to the whole 
data set. 

A number of linkage algorithms can be adopted, that 
differ in the definition of “distance” between subsets of 
points. Such a “distance” is usually not a distance ac¬ 


cording to the mathematical definition given in Sec. V A 


but rather a “dissimilarity” criterion m- This criterion 
can be given in a variety of different ways and entails 
some elements of arbitrariness. Some linkage algorithms 
make use of the coordinates of the points in the set, while 
some others rely only on distances, and are to be pre¬ 
ferred in our case. 

Among the most common methods there are the “sin¬ 
gle” , “complete”, “average”, Ward and Hausdorff linkage 
[HHIH]- In our case-study (see next section) all criteria 
yield similar results, although the single linkage suffers 
from the well-known chaining effect. We will display only 
the results obtained with the average linkage, as it be¬ 
haves better in terms of the cophenetic coefficient, to be 
introduced in the following subsection. 

The average linkage is based on the following definition 
of “distance” between two clusters U and V 


d{U,V) = J2 


j&V 


\u\*\vy 


(9) 


where the summations run over the elements of the clus¬ 
ters and \U\ and \V\ represent the cardinality of cluster 
U and V. As mentioned before, is not a distance in 
the mathematical sense, but should rather be viewed as 
a “dissimilarity” index [T5] . 


Ultrametrics are very useful when one endeavors to un¬ 
earth the structure of a (phylogenetic) tree, yielding a 
taxonomic classification of the points in the set S [13] . 

It is far from being obvious that definition 0 satisfies 
the axioms 0-0 of a bona fide mathematical distance. 
More so for property 0. 


C. Cophenetic coefficient 

The cophenetic coefficient m is a measure of how 
faithfully a dendrogram preserves the pairwise distances 
between the points of the original data set. It enables 
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FIG. 7: Error e and distance d, according to Eq. 0 - The input language is in the first column, while the output language is in 
the first row. So for example the entry e(dan,isl)= 7.12 in the upper table is obtained by feeding (about) 2,000 Danish (dan) 
sentences to the network that (mistakingly) recognizes them as Icelandic (isl) with probability 7.12% (average over 10 runs, see 
Sec. IIA|. By contrast, the entry £(isl,dan)=6.52 is obtained by feeding (about) 2,000 Icelandic sentences to the network that 
(mistakingly) recognizes them as Danish with probability 6.52%. The diagonal in the upper panel gives the mean percentage 
of correct identifications. 


one to judge how close the original metric is to an ultra¬ 
metric. It is defined as a correlation coefficient between 
original distance d and cophenetic distance on the den¬ 
drogram, defined as the distance at which two leaf nodes 
i,j are clustered together. 

The cophenetic coefficient c is defined as 

^_ - {d)){dc{i,j) - {da}) 

- {dcW 

where (d) and (dc) are the mean original distance and 
the mean cophenetic distance, respectively, and the sum 
is over all the nodes of the dendrogram. A value of c close 
to 1 signifies the presence of a good ultrametric structure. 


VI. DISTANCES AMONG LANGUAGES 

We are now ready to discriminate and classify lan¬ 
guages, by utilizing the method described in the preced¬ 
ing sections. Errors e and distances d, defined according 
to Eq. 0 . are displayed in Eig.j^for the set of languages 
shown in Fig. A few comments are in order. 

Error and distance are not symmetric, so that prop¬ 
erty 0 is not always verified. This effect becomes more 
important when the error is close to zero, as this induces 
large distances, according to Eq. 0. This will affect the 
distances of Greek and Maltese in particular, as these 
two languages are the most “different” within the group. 
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FIG. 8: Distribution function of the quantity (121 for 1330 
triads of distances taken from Fig. The parameter (121 
takes values in the range [—1,2] and becomes negative if the 
triangle inequality Q is violated. 


Henceforth, the distance will be re-defined according to 
the symmetrized formula 


— 


ei + £2 


- 1 , 


( 11 ) 


where ei ^2 are the two entries (for each couple of lan¬ 
guages) in the upper panel of Fig. 

Let us now check the validity of the triangle inequality. 
The histogram in Fig. [^displays the frequency distribu¬ 
tion of the quantity 


r = 


d{A,C)+d{C,B)-d{A,B) 


( 12 ) 


where B and C are languages in the set and the de¬ 
nominator is the max of the quantities appearing in the 
numerator. The parameter r takes values in the range 
[—1,2]. It is positive for a bona fide mathematical dis¬ 
tance and becomes negative if the triangle inequality Q 
is violated. The triangle inequality is not satished in 
about 17% of the cases. Its violation is usually small 
(less than —0.27 with probability 5.2%). These figures 
do not change much if one takes the unsymmetrized def¬ 
inition Q. We consider this a very satisfactory feature 
of the method we propose. 

The diagonal in the upper panel of Fig. [^displays the 
mean percentage of correct identifications (the probabil¬ 
ity of correctly identifying the input language). Unlike 
the off-diagonal entries, this average is evaluated on the 
whole dataset and therefore depends on the entire set of 
languages. A language that has many similar languages 
(within the given set) will have a lower mean percentage 
of correct identifications, while a language that is very 
different from the other languages in the given set will 
have a larger mean percentage of correct identifications. 
This will induce a “size of the point”, according to 0 
or (11), in violation of property 0. Rather than con¬ 
sidering this a negative aspect, we shall view this “size” 


as a measure of the “overlap” among the given language 
and other languages in the set. (This size is reproduced 
in Fig. 11 in the following.) 


VII. TREE AND ULTRAMETRIC STRUCTURE 


Figure displays the dendrogram obtained by start¬ 
ing from the symmetrized distance dg in Eq. (11) and 


then deriving a cophenetic distance dc via the average 
linkage procedure 0 outlined in Sec. |VB| Calculation 
of cophenetic coefficient (fTol) yields the value 


c~ 0.82, 


(13) 


Eq. ^ 


are 


close to one, so that the “distances” d., m 
close to an ultrametric 0. 

In order to unearth a taxonomic classification of the 
languages in the set one has to cut the dendrogram at a 
given level (ultrametric or cophenetic distance) dc- Such 
a level can be reasonably chosen by relying on a sta¬ 
bility criterion of the clustered solution. We therefore 
search for a stable partition among the hierarchy yielded 
by the clustering algorithm, in correspondence to an ap¬ 
proximately constant value of the cluster entropy S' in a 
certain range of dc [20] 




S{dc) = -Y.PdAk)\nP,M 


(14) 


k=l 


where Pd^ (fc) is the fraction of elements belonging to clus¬ 
ter k, Nd^ the number of clusters at level dc in the den¬ 
drogram and 0 < S{dc) < ln21 ~ 3.04. 


The entropy (14) is plotted in Fig. 10 as a function of 


dc- The presence of two large adjacent plateaus is mani¬ 
fest. The first one is between dc = 17.82 and dc = 33.08. 
If one cuts the dendrogram here, one obtains three large 
clusters, made up of the Romance (or Latin), Germanic 
and Slavic languages, and two isolated languages, Greek 
and Maltese, the latter being a Semitic language (written 
in Latin script), descended from an Arabic dialect spoken 
in Sicily and Malta about one thousand years ago. En¬ 
glish words make up 15-20% of the Maltese vocabulary. 

The second plateau is between dc = 33.08 and dc = 
44.79. If one cuts the dendrogram here, one obtains three 
large clusters, made up of the Romance, Germanic (in¬ 
cluding Maltese) and Slavic languages, and one isolated 
language, Greek. 

The resilience of Greek to join any cluster (see Fig. 0 
could be ascribed to the fact that we are clustering writ¬ 
ten languages, in Latin script. For instance, both ry and 
e are translated into the ASGII character for “e”, and 
both uj and “o” (omicron) into the ASGII character for 
“o”. Presumably, this has an influence on the classifica¬ 
tion. This influence seems to prevail over the differences 
of Maltese within the group. Notice also the very small 
“size” of the corresponding point in Fig. indicating 
a small error in discriminating Greek within the given 
language set. 
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FIG. 9: The language tree 



FIG. 10: Gluster entropy across the language tree. The plateaus discussed in the text are indicated by red arrows. 


In Fig. El we summarize our results by making use of 
a geographical representation. The size of the points is 
given in the diagonal of the lower panel of Fig. and is 
interpreted as the “overlap” among the given language 
and the other languages in the set, according to the com¬ 
ments at the end of Sec. |VT1 


VIII. CONCLUSIONS 

We used a neural network approach to classify a group 
of 21 Indo-European languages from written texts. We 
defined a distance among languages, clustered them ac¬ 
cording to the average linkage procedure and obtained 
a dendrogram. An ultrametric structure emerged, with 


a cophenetic coefficient close to unity. By cutting the 
dendrogram according to an entropic criterion, five sub¬ 
groups of languages were identied: Romance (or Latin), 
Germanic and Slavic languages, and two isolated lan¬ 
guages, Greek and Maltese. Maltese clusters with the 
Germanic subgroup if the cut is moved, always in accord 
with the entropic criterion. 

The method we proposed is less efficient for larger lan¬ 
guage sets. This can be ascribed to the arbitrariness of 
the definition of distance and the consequent difficulty in 
satisfying the axioms of distance (to a reasonable level 
of approximation). No a priori knowledge of the sets 
structure was given or used, in order to carefully avoid 
the introduction of unwanted bias. Possibly, the perfor¬ 
mance of the method could be improved by adding extra 

































FIG. 11: An artistic vi ew o f the linguistic classification. The circles refer to the languages in the data set of Fig. (their size 
being discussed in Sec. |VI[ ). The shades include countries where the language is similar or spoken by part of the population. 
(Powered by Mapbox [22 ] ). 


information. 

Our approach might be useful in the fields of data min¬ 
ing and big data analysis and in particular natural lan¬ 
guage processing. Although we used a simple supervised 
learning setup, we managed to extract sensible language 
features from written text. Among future applications 
one might consider political and biophysical applications, 
DNA sequence and disease classification and the inves¬ 
tigation of the structure of large complex corpora when 
no or little a priori knowledge of the structure is given 
or available. Another future challenge would consist in 
addressing the classification of spoken languages. 
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Appendix A: Learning and code 

The code is written in SciPy [25] with the MPI4Py 
package for parallel execution |5^. It makes use of 
Theano library HZIHH], originally developed in the field 
of deep machine learning |29j and optimized for complex 
tasks. 

The model implemented is a single-hidden-layer multi¬ 
layer perceptron (or artificial neural network ANN), 
mathematically represented as a function / : —>■ 

fix) = G(6(2) -b + W^^^x))), (AI) 

where 6*^^^ and 6*^^^ are bias vectors, and weight 
matrices, and G and s activation functions. In our case 
the activation function s is tanh and G is a softmax 
function. The training process is a supervised learn¬ 
ing with back-propagation |8] (learning rate Ir = O.OI) 
based on stochastic gradient descent on mini-batches 
of 20 instances, repeated for almost E = 200 epochs. 
The cost function is a negative log likelihood function 
with two regularization hyper-parameters h = 0.001 and 
I 2 = 0.0001, related to the Li norm and squared L 2 norm 
of the weight matrices and that avoid an un¬ 

controlled increase of the weights. At every epoch, for 
every mini-batch, the cost function is evaluated and the 
ANN parameters are corrected according to the gradient 
of the cost function, with the learning rate Ir as a multi¬ 
plying factor. At the end of an epoch, the error between 
prediction and effective result is computed against the 
validation set: if it is smaller, it is recorded. Before the 
100 *^ epoch, the error e typically becomes steady and 
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the results on the validation set are considered reliable. 
See Figure [T^ The entire process is repeated through 


a 5-fold cross-validation technique, also by interchanging 
languages A and B as depicted in Fig. 
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FIG. 12: Neural network learning. Error vs epoch of training. Upper panel: English vs Italian. Lower panel: French vs 
Swedish. Notice that the error on English (namely the percentage of English sentences that are mistakenly classified as Italian) 
is higher than the error on Italian. By contrast, the error of French vs Swedish is almost symmetric. 














